Probabilistic neural network for predicting hidden context of traffic entities for autonomous vehicles

ABSTRACT

An autonomous vehicle uses probabilistic neural networks to predict hidden context attributes associated with traffic entities. The hidden context represents behavior of the traffic entities in the traffic. The probabilistic neural network is configured to receive an image of traffic as input and generate output representing hidden context for a traffic entity displayed in the image. The system executes the probabilistic neural network to generate output representing hidden context for traffic entities encountered while navigating through traffic. The system determines a measure of uncertainty for the output values. The autonomous vehicle uses the measure of uncertainty generated by the probabilistic neural network during navigation.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefits of priority under 35 USC 119(e) toU.S. Provisional Application No. 62/802,151 filed on Feb. 6, 2019 andU.S. Provisional Application No. 62/822,269 filed on Mar. 22, 2019, eachof which is incorporated herein by reference in its entirety for allpurposes.

TECHNICAL FIELD

The present disclosure relates generally navigating an autonomousvehicle through traffic and more specifically to probabilistic neuralnetworks for predicting hidden context of traffic entities forautonomous vehicles.

BACKGROUND

An autonomous vehicle uses different types of sensors to receive inputdescribing the surroundings of the autonomous vehicle while drivingthrough traffic. For example, an autonomous vehicle may perceive thesurroundings using camera images and lidar scans. The autonomous vehicledetermines whether an object in the surroundings is stationary, forexample, buildings or trees or the object is non-stationary, forexample, a pedestrian, a vehicle, and so on. The autonomous vehiclesystem predicts the motion of non-stationary objects to make sure thatthe autonomous vehicle is able to navigate through non-stationaryobstacles in the traffic.

Conventional systems predict motion of non-stationary objects usingkinematics. For example, autonomous vehicles may rely on methods thatmake decisions on how to control the vehicles by predicting motionvectors of objects near the vehicles. This is accomplished by collectingdata of an objects current and past movements, determining a motionvector of the object at a current time based on these movements, andextrapolating a future motion vector representing the object's predictedmotion at a future time based on the current motion vector. However,these techniques fail to predict motion of certain non-stationaryobjects for example, pedestrians, bicyclists, and so on. For example, ifthe autonomous vehicle detects a pedestrian standing in a street corner,the motion of the pedestrian does not help predict whether thepedestrian will cross the street or whether the pedestrian will remainstanding on a street corner. Similarly, if the autonomous vehicledetects a bicyclist in a lane, the current motion of the bicycle doesnot help the autonomous vehicle predict whether the bicycle will changelanes. Failure of autonomous vehicles fail to accurately predict motionof non-stationary traffic objects results in unnatural movement of theautonomous vehicle, for example, as a result of the autonomous vehiclesuddenly stopping due to a pedestrian moving in the road or theautonomous vehicle continuing to wait for a person to cross a streeteven if the person never intends to cross the street.

SUMMARY

Embodiments of the invention use probabilistic neural networks topredict hidden context attributes associated with traffic entities. Atraffic entity may represent a pedestrian, a bicyclist, or anothervehicle in a traffic encountered by a vehicle. The hidden context of atraffic entity represents behavior of the traffic entities in thetraffic. The system trains a probabilistic neural network for help innavigating through traffic. The probabilistic neural network isconfigured to receive an image of traffic as input and generate outputrepresenting hidden context for a traffic entity displayed in the image.

The probabilistic neural network generates a feature vector for aplurality of features. The feature vector comprises values describingstatistical distribution for each feature. The generated output of theprobabilistic neural network comprises a plurality of values, each valuerepresenting a likelihood of receiving a particular user response from auser presented with the image. The training data set may comprisestatistical information describing user responses of users presentedwith images of traffic.

The system receives a new camera image captured by a camera mounted onan autonomous vehicle navigating through traffic. The system executesthe probabilistic neural network to generate output representing hiddencontext for a traffic entity displayed in the new image. The systemdetermines a measure of uncertainty for each of the plurality of values.The autonomous vehicle navigates to avoid the traffic entity identifiedin the new image. The navigation of the autonomous vehicle is based onat least the measure of uncertainty generated by the probabilisticneural network.

In an embodiment, the values describing statistical distribution foreach feature comprise a mean value and a standard deviation for thefeature. The probabilistic neural network is configured to generatesamples of features that correspond to their respective distributions.These samples are used to generate different outputs. The distributionof the generated outputs is used to determine a measure of uncertaintyof the outputs. The measure of uncertainty is used to navigate thevehicle through traffic, for example, to determine how far to stay froma traffic entity displayed in the image that was provided as input tothe probabilistic neural network.

In an embodiment, training the probabilistic neural network comprisesdetermining evidence lower bound between the predicted output and labelsof training data set. For example, the predicted output can be aplurality of values and the labels represent a plurality of valuesrepresenting statistical summary determined from user responses obtainedfrom users presented with images. The training process performsbackpropagation based on the evidence lower bound.

In an embodiment, determining the measure of uncertainty for each of theplurality of values is performed by generating a plurality of samplesfor the input image using the probabilistic neural network anddetermining a confidence interval for each of the plurality of inputusing the plurality of samples.

In an embodiment, the autonomous vehicle navigates to ensure that theautonomous vehicle stays at least a threshold distance away from thetraffic entity displayed in the new image. The threshold distance isdetermined based on the measure of uncertainty generated by theprobabilistic neural network. For example, the threshold distance isdetermined to be a value directly related to the measure of uncertaintygenerated by the probabilistic neural network.

The hidden context may represent a state of mind of a user representedby the traffic entity. The hidden context may represent a task that auser represented by the traffic entity is planning on accomplishing. Thehidden context may represent a goal of a user represented by the trafficentity, wherein the user expects to achieve the goal within a thresholdtime interval. For example, the hidden context may represent a near termgoal of the person represented by the traffic entity, for example,indicating that the person is likely to cross the street, or indicatingthat the person is likely to pick up an object (e.g., a wallet) droppedon the street but stay on that side of the street, or any other taskthat the person is likely to perform within a threshold time interval.The hidden context may represent a degree of awareness of the autonomousvehicle by a user represented by the traffic entity, for example,whether a bicyclist driving in front of the autonomous vehicle is likelyto be aware that the autonomous vehicle is behind the bicycle.

In an embodiment, navigating the autonomous vehicle comprises generatingsignals for controlling the autonomous vehicle based on the motionparameters, the hidden context of a traffic entity, and the measure ifuncertainty generated for the traffic entity by the probabilistic neuralnetwork. The generated signals are sent to the controls of theautonomous vehicle.

In an embodiment, the probabilistic neural network is a probabilisticconvolutional neural network.

BRIEF DESCRIPTION OF FIGURES

Various objectives, features, and advantages of the disclosed subjectmatter can be more fully appreciated with reference to the followingdetailed description of the disclosed subject matter when considered inconnection with the following drawings, in which like reference numeralsidentify like elements.

FIG. 1 is a system diagram of a networked system for predicting humanbehavior according to some embodiments of the invention.

FIG. 2 is the system architecture of a vehicle computing system thatroutes an autonomous vehicle based on prediction of hidden contextassociated with traffic objects according to an embodiment of theinvention.

FIG. 3 is a system diagram showing a sensor system associated with avehicle, according to some embodiments of the invention.

FIG. 4 is a flowchart showing a process of training a machine learningbased model to predict hidden context information describing trafficentities, according to some embodiments of the invention.

FIG. 5 is a flowchart showing a process of predicting the state of mindof road users using a trained learning algorithm, according to someembodiments of the invention.

FIG. 6 is a diagram showing an example of an application of a contextuser prediction process in an automobile context, according to someembodiments of the invention.

FIG. 7 represents a flowchart illustrating the process of navigating theautonomous vehicle based on hidden context, according to an embodiment.

FIG. 8 represents a flowchart illustrating the process of generatingtraining dataset for training of neural network illustrated in FIG. 2,according to an embodiment.

FIG. 9 represents a flowchart illustrating the process of training ofneural network illustrated in FIG. 1B, according to an embodiment.

FIG. 10 represents a flowchart illustrating the process of generatingtraining dataset for training of neural network illustrated in FIG. 2,according to another embodiment.

FIG. 11 is a block diagram illustrating components of an example machineable to read instructions from a machine-readable medium and executethem in a processor (or controller).

DETAILED DESCRIPTION

Embodiments of the invention predict hidden context associated withtraffic entities that determines behavior of these traffic entities inthe traffic. A traffic entity represents an object in traffic, forexample, a pedestrian, a bicycle, a vehicle, a delivery robot, and soon. Hidden context includes factors that affect the behavior of suchtraffic entities, for example, a state of mind of a pedestrian, a degreeof awareness of the existence of the autonomous vehicle in the vicinity(for example, whether a bicyclist is aware of the existence of theautonomous vehicle in the proximity of the bicyclist), and so on. Thesystem uses the hidden context to predict behavior of people near avehicle in a way that more closely resembles how human drivers wouldjudge the behavior.

In one embodiment, a group of users (or human observers) view sampleimages of traffic entities (such as pedestrians) near streets and/orvehicles and indicate or are measured for their understanding of howthey believe the people will behave. These indicators or measurementsare then used as a component for training a machine learning based modelthat predicts how people will behave in a real-world context. In otherwords, after being trained based on the reactions of human observers tosample images in a training environment, the machine learning basedmodel predicts behavior of traffic entities in a real-world environment,for example, actual pedestrian behavior in a real-world environment.

A non-stationary object may also be referred to as a movable object. Anobject in the traffic may also be referred to as an entity or a trafficentity.

Systems for predicting human interactions with vehicles are disclosed inU.S. patent application Ser. No. 15/830,549, filed on Dec. 4, 2017 whichis incorporated herein by reference in its entirety.

System Environment

FIG. 1 is a system diagram of a networked system for predicting humanbehavior according to some embodiments of the invention. FIG. 1 shows avehicle 102, a network 104, a server 106, a user response database 110,a client device 108, a model training system 112 and a prediction engine114.

The vehicle 102 can be any type of manual or motorized vehicle such as acar, bus, train, scooter, or bicycle. As described in more detail below,the vehicle 102 can include sensors for monitoring the environmentsurrounding the vehicle. In one implementation, the sensors can includea camera affixed to any portion of the vehicle for capturing a video ofpeople near the vehicle.

The network 104 can be any wired and/or wireless network capable ofreceiving sensor data collected by the vehicle 102 and distributing itto the server 106, the model training system 112, and, through the modeltraining system 112, the prediction engine 114. In an embodiment, theprediction engine 114 comprises a neural network 120.

The server 106 can be any type of computer system capable of (1) hostinginformation (such as image, video and text information) and deliveringit to a user terminal (such as client device 108), (2) recordingresponses of multiple users (or human observers) to the information, and(3) delivering such information and accompanying responses (such asresponses input via client device 108) back to the network 104.

The user response database 110 can be any type of database or datastorage system capable of storing the image, video, and text informationand associated user responses and subsequently recalling them inresponse to a query.

The model training system 112 can be implemented in any type ofcomputing system. In one embodiment, the system 112 receives the image,video, and/or text information and accompanying, or linked, userresponses from the database 110 over the network 104. In someembodiments, the text segments are discrete values or free textresponses. The model training system 112 can use images, video segmentsand text segments as training examples to train an algorithm, and cancreate labels from the accompanying user responses based on the trainedalgorithm. These labels indicate how the algorithm predicts the behaviorof the people in the associated image, video, and/or text segments.After the labels are created, the model training system 112 can transmitthem to the prediction engine 144.

The prediction engine 114 can be implemented in any computing system. Inan illustrative example, the prediction engine 114 executes a processthat executes a model that has been trained by the model training system112. This process estimates a label for a new (e.g., an actual“real-world”) image, video, and/or text segment based on the labels andassociated image, video, and/or text segments that it received from themodel training system 112. In some embodiments, this label comprisesaggregate or summary information about the responses of a large numberof users (or human observers) presented with similar image, video, ortext segments while the algorithm was being trained.

In an embodiment, the prediction engine 114 uses machine learning basedmodels for predicting hidden context values associated with trafficentities. In an embodiment, the machine learning based model is a neuralnetwork configured to receive an encoding of an image or a video of atraffic entity as input and predict hidden context associated with thetraffic entity. Examples of traffic entities include pedestrians,bicyclists, or other vehicles. Examples of hidden context include,awareness of a bicyclist that a particular vehicle is driving close tothe bicyclist, and intent of a pedestrian, for example, intent to crossa street, intent to continue walking along a sidewalk, and so on.

FIG. 2 is the system architecture of neural network used for predictionof hidden context associated with traffic entities, according to someembodiments of the invention. The neural network 120 is a deep neuralnetwork comprising a plurality of layers of nodes. The layers of theneural network comprise an input layer that receives the input values,an output layer that outputs the result of the neural network and one ormore hidden layers. Each hidden layer receives input from a previouslayer and generates values that are provided as input to a subsequentlayer. In an embodiment, the neural network 120 is a probabilisticneural network. In an embodiment, the neural network 120 is aconvolutional neural network, for example, a probabilistic convolutionalneural network. In an embodiment, the neural network 120 is an LSTM(long short-term model).

The neural network 120 receives an encoding of an image or video asinput 123. The neural network is configured to predict estimates ofmeasures of uncertainty for hidden context attributes. The input 123comprises stored images or videos provided as training data during atraining phase of the neural network 120. An image may represent a videoframe. Once the neural network 120 is trained, the neural network 120may be deployed in a vehicle, for example, an autonomous vehicle.

In an embodiment, the neural network 120 is a multi-task neural networkconfigured to predict a plurality of output values representingdifferent hidden context attributes. A multi-task neural networkprovides efficiency in training the model since the same model is ableto predict multiple values. Accordingly, the process of training of theneural network as well as execution of the trained neural network isefficient in terms of performance. Furthermore, the sharing of thefeature extraction component 125 across different prediction components130 results in better training of the neural network.

The sensors of an autonomous vehicle capture sensor data 160representing a scene describing the traffic surrounding the autonomousvehicle. The traffic includes one or more traffic entities, for example,a pedestrian. The autonomous vehicle provides sensor data as input tothe neural network 120, for example, video frames of videos captured bycameras of the autonomous vehicle. In an embodiment, the input to theneural network 120 is a portion of a video frame that represents abounding box around a traffic entity, for example, a pedestrian. In anembodiment, the input to the neural network is a sequence of boundingboxes surrounding the traffic entity obtained from a sequence of videoframes showing the traffic entity, for example, in a video of apedestrian captured as the pedestrian walks on a street. The autonomousvehicle uses the results of the neural network model to generate controlsignals for providing to the vehicle controls for example, accelerator,brakes, steering, and so on for navigating the autonomous vehiclethrough traffic.

The neural network 120 comprises components including a featureextraction component 125 and a plurality of prediction components 130 a,130 b, 130 c, and so on. Each prediction component 130 predicts valuesfor a particular hidden context attribute. For example, a predictioncomponent 130 a may predict values describing intent of a pedestrian toperform certain action (e.g., crossing the street), a predictioncomponent 130 b may predict values describing awareness of a bicyclistof a vehicle following the bicyclist, and so on.

Each prediction component outputs two values associated with a hiddencontext attribute, a value 132 representing the predicted value of thehidden context attribute and a value 134 representing a measure ofuncertainty associated with the predicted value 132. In an embodiment,the predicted value 132 represents parameters describing statisticaldistribution of a hidden context attribute. In an embodiment, thepredicted value 132 is a vector such that each value of the vectorrepresents a likelihood that at an observer would assign a particularvalue to the hidden context attribute. For example, the hidden contextattribute may have a plurality of possible values v1, v2, v3, and so onand the predicted value 132 is a vector comprising probability valuesp1, p2, p3, and so on such that p1 represents a likelihood that at anobserver would assign value v1 to the hidden context attribute, p2represents a likelihood that at an observer would assign value v2 to thehidden context attribute, p3 represents a likelihood that at an observerwould assign value v3 to the hidden context attribute, and so on.

The predicted value 134 is a measure of uncertainty corresponding to theoutput 132. Accordingly, if the output 132 is a vector of multiplevalues, the output 134 is also a vector of multiple values, each valueof the vector corresponding to output 134 representing a measure ofuncertainty for the corresponding value in the vector corresponding tooutput 132.

As an example, the system requests user responses from users observingimages of traffic entities, wherein each user response is one of aplurality of values. Each value from the plurality of values correspondsto a rating that the user provides to a hidden context of the trafficentity. For example, the user response may be a value between 1 to 5(e.g., each user response is one value selected from the values 1, 2, 3,4, and 5). The user response indicates, what the user believes is thevalue of the hidden context attribute, for example, on a scale from 1-5,a number indicating how likely the user believes, a pedestrian is likelyto cross the street or how likely a bicyclist is aware of a vehicle. Theoutput 132 represents a vector of values, each value corresponding to apossible user response and representing the likelihood that a userpresented with the input image would provide that user response. Theoutput 134 represents a vector of values, each value representing ameasure of uncertainty for the corresponding value from output 132.

In an embodiment, the neural network 120 is a probabilistic neuralnetwork that may generate different outputs for the same input if theneural network is executed repeatedly. However, the outputs generatedhave a particular statistical distribution, for example, mean andstandard deviation. The training process adjusts the parameters of theneural network so that the statistical distribution of the predictedoutput matches the statistical distribution of the labels in thetraining dataset. The statistical distribution is determined byparameters of the neural network that can be adjusted to generatedifferent statistical distributions. In an embodiment, the featureextraction component generates features such that each feature value isassociate with a statistical distribution, for example, mean andstandard deviation values.

In an embodiment, the system trains the probabilistic neural networkusing stochastic variational inference techniques. The stochasticvariational inference technique approximates a posterior likelihood (orprobability) of data, given the probabilities that are determined usingthe random variables of the probabilistic neural network. The systemadjusts the weights of the probabilistic neural network using backpropagation. The system performs back propagation to optimize theevidence lower bound between the predicted values using images from thetraining data set and the actual values determined from the userresponses obtained in response to observing images from the trainingdata set. In an embodiment, the system determined evidence lower boundusing Kullback-Leibler divergence between the predicted values and theactual values obtained from user responses. The system usesKullback-Leibler divergence to compare the statistical distribution ofthe predicted values with the statistical distribution of the actualvalues obtained from user responses.

In an embodiment, the neural network 120 generates uncertainty estimatevalues corresponding to each of the plurality of possible values of thehidden context attribute. For example, the hidden context attributevalues may be classified using a plurality of bins, each binrepresenting a range (or set) of values. The neural network 120generates uncertainty estimate values for each bin.

FIG. 3 is a system diagram showing a sensor system associated with avehicle, according to some embodiments of the invention. FIG. 3 shows avehicle 306 with arrows pointing to the locations of its sensors 300, alocal processor and storage 302, and remote storage 304.

Data is collected from cameras or other sensors 300 including solidstate Lidar, rotating Lidar, medium range radar, or others mounted onthe car in either a fixed or temporary capacity and oriented such thatthey capture images of the road ahead, behind, and/or to the side of thecar. In some embodiments, the sensor data is recorded on a physicalstorage medium (not shown) such as a compact flash drive, hard drive,solid state drive or dedicated data logger. In some embodiments, thesensors 300 and storage media are managed by the processor 302.

The sensor data can be transferred from the in-car data storage mediumand processor 302 to another storage medium 304 which could includecloud-based, desktop, or hosted server storage products. In someembodiments, the sensor data can be stored as video, video segments, orvideo frames.

FIG. 4 is a flowchart showing a process of training a machine learningbased model to predict hidden context information describing trafficentities, according to some embodiments of the invention. In oneimplementation, video or other data is captured by a camera or sensormounted on the vehicle 102. The camera or other sensor can be mounted ina fixed or temporary manner to the vehicle 102. Of course, the cameradoes not need to be mounted to an automobile, and could be mounted toanother type of vehicle, such as a bicycle. As the vehicle travels alongvarious streets, the camera or sensor captures still and/or movingimages (or other sensor data) of pedestrians, bicycles, automobiles,etc. moving or being stationary on or near the streets. In step 402,this video or other data captured by the camera or other sensor istransmitted from the vehicle 102, over the network 104, and to theserver 106 where it is stored.

Then, in step 404, video frames or segments are extracted from thestored video or other data and are used to create stimulus dataincluding derived stimulus (or stimuli). In one implementation, thederived stimulus corresponds to a scene in which one or more humans areconducting activities (e.g., standing, walking, driving, riding abicycle, etc.) beside or on a street and/or near a vehicle. As explainedin more detail below for example in step 214 and in the textaccompanying FIG. 9, as part of the training process for the predictionalgorithm, human observers view the derived stimulus and predict howthey believe the humans shown in the derived stimulus will act. In yet afurther implementation, after the video frames or segments are extractedfrom the stored data, the derived stimulus is generated by manipulatingthe pixels or equivalent array data acquired from the camera or othersensor in step 204, producing a new data file that conveys a portion ofthe information from the original video with certain aspects highlightedor obscured.

In step 406, the derived stimulus is transmitted from the server 106 anddisplayed to a large number of users (or human observers) on the clientdevice 108 (or multiple client devices 108). The client devices(s) 108prompt the human observers to predict how the people shown in thederived stimulus will act, and upon viewing the displayed stimulus, theobservers input their responses corresponding to their predictions. Forexample, the human observers may predict that a bicyclist will continueriding, whether a first person in the stimulus will cross the street,whether another person will remain standing on a street corner, and yetanother person will change lanes on his or her bicycle. In anillustrative embodiment, the human observers may make a continuous orordinal judgment about the state of mind or the predicted behavior ofthe people shown in the derived stimulus and record that judgment. Forexample, the human observers may select an appropriate icon displayed onthe client devices(s) 108 by clicking a mouse or by pressing a key toindicate their judgment or prediction. The judgment or prediction maycorrespond to the human observers' assessment of the state of mind ofthe person in the derived stimulus or other awareness or intention thatwould be relevant to a hypothetical driver who sees the person in thederived stimulus while driving. In step 408, the derived stimulus andassociated human observer responses are transmitted from the clientdevice(s) 108 to the server 106 and recorded in the user responsedatabase 110.

In step 410, summary statistics are generated based on the userresponses. For example, the statistics may characterize the aggregateresponses of multiple human observers to a particular derived stimulus.For instance, if the derived stimulus shows a pedestrian walking on asidewalk towards an intersection, the response can be categorized interms of how many human observers believe that the pedestrian will stopupon reaching the intersection, continue walking straight across theintersection, turn a corner and continue walking along the sidewalkwithout crossing the intersection, etc. These summary statistics cancharacterize the human observer responses in terms of certain parametersassociated with the statistics, such as a content of a response, a timeassociated with entering a response, and a position of an eye of a humanobserver associated with the response. The parameters can also beassociated with a (1) central tendency, variance, skew, kurtosis, scale,or histogram. For example, the amount of time users took to input theirresponses can be characterized in terms of central tendency, variance,skew, kurtosis, scale, histogram. Also, the statistics can include aparameter that additionally or alternatively characterizes the movementof the human observers' eyes relative to a display when making thejudgments in terms of central tendency, variance, skew, kurtosis, scale,histogram or two-dimensional distribution. In one embodiment, thestatistics are stored in the user response database 110 with an indexthat identifies the raw video or sensor data from which the derivedstimulus was generated. In a further embodiment, the statistics storedin the database 110 cover a large set of images of people on or nearroads and are categorized in a number of different categories, such aspedestrian, driver, motorcyclist, bicyclist, scooter driver,self-balancing scooter rider, unicyclist, motorized wheelchair user,skateboarder, or others. Moreover, the statistics are respectivelystored along with, or linked to, the images of the derived stimulicorresponding to the statistics.

In step 412, the stored statistics and corresponding images (e.g., thevideo frames or segments that were extracted from the video or otherdata (captured from the camera or sensor of the vehicle 202)) are sentover the network 104 to the model training system 112 and used to traina prediction algorithm. For example, the collection of images andstatistics can be used to train a supervised learning algorithm, whichcan comprise a random forest regressor, a support vector regressor, asimple neural network, a deep convolutional neural network, a recurrentneural network, a long-short-term memory (LSTM) neural network withlinear or nonlinear kernels that are two dimensional or threedimensional, or any other supervised learning algorithm which is able totake a collection of data labeled with continuous values and adapt itsarchitecture in terms of weights, structure or other characteristics tominimize the deviation between its predicted label on a novel stimulusand the actual label collected on that stimulus using the same method aswas used on the set of stimuli used to train that network. The model isgiven data which comprises some subset of the pixel data from the videoframes that the summary statistics were generated from. In oneimplementation, this subset includes the pixel data contained in abounding box drawn to contain the boundaries of the person, cyclist,motorist and vehicle, or other road user, including their mode ofconveyance. In some other implementations, it also includes the entirepixel data from the rest of the image. In one of those implementations,that pixel data is selected according to criteria such as the salienceof those features in terms of contrast, lighting, presence of edges, orcolor. In an additional implementation, the features can includedescriptive meta-data about the images such as the dimensions andlocation of the bounding box, the shape of the bounding box or thechange in size or position of the bounding box from one frame to thenext.

In step 414, the prediction engine 114 uses the trained model from themodel training system 112 to predict the actual, “real-world” or “livedata” behavior of people on or near a road. In one embodiment, theprediction engine 114 receives “live data” that matches the format ofthe data used to train the trained model. For example, if the trainedmodel was trained based on video data received from a camera on thevehicle 102, the “live data” that is input to the algorithm likewise isvideo data from the same or similar type camera. On the other hand, ifthe model was trained based on another type of sensor data received fromanother type of sensor on the vehicle 102, the “live data” that is inputto the prediction engine 114 likewise is the other type of data from thesame or similar sensor.

The trained model or algorithm makes a prediction of what a pedestrianor other person shown in the “live data” would do based on the summarystatistics and/or training labels of one or more derived stimulus. Theaccuracy of the model is determined by having it make predictions ofnovel derived stimuli that were not part of the training imagespreviously mentioned but which do have human ratings attached to them,such that the summary statistics on the novel images can be generatedusing the same method as was used to generate the summary statistics forthe training data, but where the correlation between summary statisticsand image data was not part of the model training process. Thepredictions produced by the trained model comprise a set of predictionsof the state of mind of road users that can then be used to improve theperformance of autonomous vehicles, robots, virtual agents, trucks,bicycles, or other systems that operate on roadways by allowing them tomake judgments about the future behavior of road users based on theirstate of mind.

The server 106 generates derived stimuli from raw camera or sensor dataof the vehicle for presenting to human observers. As described above,sensor data can include video segments or specific frames. These framescan either be contiguous or non-contiguous, and can be in the originalorder, in a permuted order, in reverse order, or in random order. Someof the frames can be repeated once or more than once.

Some of the frames may be manipulated. The frames can be manipulated byadjusting pixel values. These manipulations can include blurring, theaddition or one or more occluding bars, bands, or shapes, sharpening,the removal of color information, the manipulation of color information,the drawing of non-occluding or highlighting shapes on the image, othermanipulations, or a combination of the manipulations listed here, or acombination of the manipulations listed here with other manipulations,or other manipulations of the pixels not listed combined with eachother. The manipulations serve the purpose of highlighting, occluding ordegrading portions of the image, so that when the images are shown tothe human observers, they are directed to people or specific portions ofthe image when predicting what the people in the images will do. Forexample, using the highlighting described above, a certain pedestrian ina scene can be isolated such that a human observer's feedback can bemore reliably associated with the pedestrian. Frames may be recombinedto form a derived stimulus. In some embodiments, if there is only oneframe, that frame comprises the derived stimulus. If there is more thanone frame those frames may then be recombined.

Predictions and other information is collected from human observersbased on derived stimuli. Human observers are given detailedinstructions about how to answer questions about derived stimuli. Thoseobservers are presented with derived stimuli and asked to answerquestions about them. The observers respond to the stimuli and thoseresponses are recorded. The recorded responses are aggregated and loggedin a database, for example, the user response database 110.

Human observers are recruited to participate on one or severalcrowdsourcing websites, such as Amazon's Mechanical Turk or at aphysical location provided with a display. The observers are givendetailed written and pictorial instructions explaining the task thatthey are about to complete. These instructions give examples ofsituations that might be depicted in the derived stimuli, and the kindsof responses that would be appropriate for those situations.

The human observers may be shown a display which includes the derivedstimulus. The display also includes a mechanism for making a judgmentabout the stimulus. The mechanism for making the judgment can be acontinuous indicator such as a ribbon on which the observer could drag acontrol to a certain point. The mechanism can also be an ordinal measuresuch as a Likert scale where the observer can make a judgment about adegree of certainty of the judgment. The mechanism can also be a controlthat the human observer drags with their mouse to draw a trajectoryonscreen indicating a judgment. The mechanism can also be a text entryfield where the observer types a description of their judgment.

The judgment that the human observer makes is a hidden context attributethat may represent an evaluation of the state of mind of a road userdepicted in the derived stimulus. The evaluation can be of theintention, awareness, personality, state of consciousness, level oftiredness, aggressiveness, enthusiasm, thoughtfulness or anothercharacteristic of the internal mental state of the pictured road user.If the ratings collected are on an ordinal scale they can describe thecharacteristic using language of probability, such as “the other drivermay be attentive” or “the other driver” is definitely attentive” or “theother driver is definitely not attentive”.

The ratings of large numbers of human observers are collected. Summarystatistics are generated based on the responses of all of the observerswho looked at an image. Individual variability in responses to a givenstimulus can be characterized in the information given by the observersto the learning algorithm. The summary statistics might includeunweighted information from all observers, or might exclude observersbased on extrinsic or intrinsic criteria such as the time it took anobserver to respond, the geographical location of an observer, theobserver's self-reported driving experience, or the observer'sreliability in making ratings of a set of other images.

The explicit response of the observer is recorded as well as implicitdata. The implicit data can include how long the subject took torespond, if they hesitated in their motions, if they deleted keystrokes,if they moved the mouse anywhere other than the location correspondingto the response they eventually chose, where their eyes moved, or otherimplicit measures.

The responses are aggregated and recorded in a data structure, such asthe user response database 110. This data structure is then sent as atext field to a networked computer system running database software andlogged in a database.

For each stimulus rated by each human observer, a response is recordedthat could be a continuous, discrete, or ordinal value. This value mayrefer to the probability of the pictured human road user has a givenstate of mind—e.g. that a pedestrian is likely to cross the street orthat an oncoming vehicle is unlikely to be willing to yield to thevehicle containing the sensor if the vehicle containing the sensor needsto turn. In some embodiments, a higher ordinal value (e.g., the ordinal4 as shown in FIG. 6) indicates that a human observer believes thatthere is a higher probability that the pictured human road user has agiven state of mind or will perform a particular action. On the otherhand, a lower ordinal value (e.g., the ordinal 1) indicates that thehuman observer believes that there is a lower probability that thepictured human road user has the state of mind or will perform theparticular action. On the other hand, in some embodiments, a lowerordinal value can indicate a higher probability of an action, and ahigher ordinal value can indicate a lower probability of an action.

An amount of time associated with a subject responding to the derivedstimulus may also be recorded. In some embodiments, this time isassociated with the overall reliability of the human observer's rating.For example, a response associated with a lower response time may beweighted higher and a response associated with a slower response timemay be weighted lower.

Summary statistics of a video frame or derived stimulus is generated.These summary statistics could include measurements of the centraltendency of the distribution of scores like the mean, median, or mode.They could include measurements of the heterogeneity of the scores likevariance, standard deviation, skew, kurtosis, heteroskedasticity,multimodality, or uniformness. They could also include summarystatistics like those above calculated from the implicit measurements ofthe responses listed above. The calculated summary statistics are linkedto the video frame or sensor data frame associated with the responsesfrom which they were calculated.

The summary statistics is used for training machine learning basedmodels. The machine learning based model may be any type of supervisedlearning algorithm capable of predicting a continuous label for a two orthree dimensional input, including but not limited to a random forestregressor, a support vector regressor, a simple neural network, a deepconvolutional neural network, a recurrent neural network, along-short-term memory (LSTM) neural network with linear or nonlinearkernels that are two dimensional or three dimensional.

In one embodiment of the model training system 112, the machine learningbased model can be a deep neural network. In this embodiment theparameters are the weights attached to the connections between theartificial neurons comprising the network. Pixel data from an image in atraining set collated with human observer summary statistics serves asan input to the network. This input can be transformed according to amathematical function by each of the artificial neurons, and then thetransformed information can be transmitted from that artificial neuronto other artificial neurons in the neural network. The transmissionbetween the first artificial neuron and the subsequent neurons can bemodified by the weight parameters discussed above. In this embodiment,the neural network can be organized hierarchically such that the valueof each input pixel can be transformed by independent layers (e.g., 10to 20 layers) of artificial neurons, where the inputs for neurons at agiven layer come from the previous layer, and all of the outputs for aneuron (and their associated weight parameters) go to the subsequentlayer. At the end of the sequence of layers, in this embodiment, thenetwork can produce numbers that are intended to match the human summarystatistics given at the input. The difference between the numbers thatthe network output and the human summary statistics provided at theinput comprises an error signal. An algorithm (e.g., back-propagation)can be used to assign a small portion of the responsibility for theerror to each of the weight parameters in the network. The weightparameters can then be adjusted such that their estimated contributionto the overall error is reduced. This process can be repeated for eachimage (or for each combination of pixel data and human observer summarystatistics) in the training set. At the end of this process the model is“trained”, which in some embodiments, means that the difference betweenthe summary statistics output by the neural network and the summarystatistics calculated from the responses of the human observers isminimized.

FIG. 5. is a flowchart showing a process of predicting the state of mindof road users using a trained learning algorithm, according to someembodiments of the invention. In step 500, the training algorithmreceives a “real world” or “live data” video or sensor frame. Then instep 502, the trained algorithm analyzes the frame, thus enabling thealgorithm in step 504 to output a prediction of summary statistics onthe frame.

The “real world” or “live data” video or other sensor frames from acar-mounted sensor are delivered to the trained learning algorithm instep 500. These frames have the same resolution, color depth and fileformat as the frames used to train the algorithm. These frames aredelivered as individual frames or as sequences according to the formatused to train the original algorithm.

Each of these frames is analyzed by being passed through the trainedmodel in step 502. In one embodiment, the data from the frame that waspassed through the model would comprise the pixel data from a camera.This data would be transformed by a trained artificial neural network.At the final stage of the processing in the artificial network, it wouldproduce an output. This output is the model output in step 504.

The model outputs a number or set of numbers that comprise the predictedsummary statistics for the “real world” or “live data” image in step504. The predicted summary statistics are the model's best estimation ofwhat the summary statistics would be on the image if the image had humanannotations collected. The prediction is generated automatically bypassing the sensor data through the model, where the information istransformed by the internal mechanisms of the model according to theparameters that were set in the training process. Because these summarystatistics characterize the distribution of human responses that predictthe state of mind of a road user pictured in the stimulus, the predictedstatistics are therefore a prediction of the aggregate judgment of humanobservers of the state of mind of the pictured road user and thus anindirect prediction of the actual state of mind of the road user.

FIG. 6 is a diagram showing an example of an application of a contextuser prediction process in an automobile context, according to someembodiments of the invention. In this example intention 606 618 meansthat the road user 602 614 has the goal of moving into the path of thevehicle 600 before the vehicle 600 (on which the system is mounted)reaches their position. Awareness 604 616 in this example means that theroad user 602 614 understands that the vehicle on which the system ismounted 600 is present in their vicinity. In this example, when cyclist602 rides into the field of view of a camera mounted on vehicle 600, thepixel data of the camera image of the cyclist is fed to a trainedmachine learning based model as described above in step 500. The trainedmachine learning based model analyzes the image as described above instep 502. The trained machine learning based model predicts summarystatistics as in step 504. These summary statistics are an estimate ofwhat the summary statistics would be for a collection of human observerswho were shown a derived stimulus of the camera data. The estimatessummary statistics are therefore the system's best answer to thequestion “does this cyclist intend to enter the path of the vehicle.”The vehicle is therefore able to make a guess 606 about the intention ofthe cyclist that is closely matched to the guess that a human driverwould make in that same situation. In this example, the intention of thecyclist 606 is relatively high, as indicated by the number of horizontalbars in the display. The system installed on an automobile 600 alsomakes predictions about the awareness 604 of cyclists of the vehicle600, by the same method described for intention. It also makespredictions about the willingness of an automobile 608 to yield 610 orits desire to turn across the system-containing vehicle's path 612 bythe same method described above. In the case of the automobile thequestions that human subjects answered that would be predicted by thealgorithm are “would the vehicle be willing to yield” 610 and “does thevehicle wish to turn across your path” 612. It also makes predictionsabout the likelihood of pedestrians 614 to cross in front of the vehicle618, and whether those pedestrians are aware of the vehicle 616, by thesame method described above.

The models described above may be implemented as a real-time module thatmakes predictions of behavior of traffic entities based on input fromcameras or other sensors installed on a car 600. In the case of anautonomous car, these predictions may be used to make inferences aboutthe intent of road users such as cyclists 602, other motorists 608, andpedestrians 614 to cross into the path of the car, as well as whetherthe road users are aware of the car and its future path. They can alsobe used to predict whether other road users would be surprised,welcoming, or aggressively unwelcoming if the car were to engage inmaneuvers which would take it into the path of another road user (e.g.,would an oncoming car yield if the car implementing the systems andmethods described herein were to turn left).

Navigating Autonomous Vehicle Based on Hidden Context

The vehicle computing system 122 predicts hidden context representingintentions and future plans of a traffic entity. The hidden context maybe used for navigating the autonomous vehicle, for example, by adjustingthe path planning of the autonomous vehicle based on the hidden context.The vehicle computing system 122 may improve the path planning by takinga machine learning based model that predicts the hidden contextrepresenting a level of human uncertainty about the future actions ofpedestrians and cyclists and uses that as an input into the autonomousvehicle's motion planner. The training dataset of the machine learningmodels includes information about the ground truth of the world obtainedfrom one or more computer vision models. The prediction engine 114 andthe trained neural network 120 is provided to a vehicle computing system122 of a vehicle (e.g., an autonomous vehicle) for execution at runtime, for example, while navigating the vehicle through traffic. Thevehicle computing system 122 may use the output of the prediction engine114 to generate a probabilistic map of the risk of encountering anobstacle given different possible motion vectors at the next time step.Alternatively, the vehicle computing system 122 may use the output ofthe prediction engine 114 to determine a motion plan which incorporatesthe probabilistic uncertainty of the human assessment.

In an embodiment, the prediction engine 114 determines a metricrepresenting a degree of uncertainty in human assessment of thenear-term goal of a pedestrian or any user representing a trafficentity. The specific form of the representation of uncertainty is amodel output that is in the form of a probability distribution,capturing the expected distributional characteristics of user responsesof the hidden context of traffic entities responsive to the users beingpresented with videos/images representing traffic situations. The modeloutput may comprise summary statistics of hidden context, i.e., thecentral tendency representing the mean likelihood that a person will actin a certain way and one or more parameters including the variance,kurtosis, skew, heteroskedasticity, and multimodality of the predictedhuman distribution. These summary statistics represent information aboutthe level of human uncertainty.

In an embodiment, the vehicle computing system 122 determines themeasure of uncertainty as a confidence interval for a predicted output132. At inference time (i.e., execution time) for the neural network120, the vehicle computing system 122 executes the neural network 120 togenerate multiple samples per output value. For example, if the userresponse can be any one of a plurality of user response values, theprobabilistic neural network generates an output value corresponding toeach of the plurality of user response values. The vehicle computingsystem 122 executes the probabilistic neural network to generate msamples for each output value (where m is value greater than 1).Accordingly, there are m probabilities generated per output value. Thevehicle computing system 122 uses this collection of probabilities foreach output value to determine the measure of uncertainty for thatoutput value. In an embodiment, the vehicle computing system 122generates a measure of uncertainty represented by a confidence interval.For example, the vehicle computing system 122 may generate a 95%confidence interval by considering the 2.5th and 97.5th percentile foreach possible output value. Each output value may correspond to a binrepresenting a possible user response.

The vehicle computing system 122 may determine the confidence intervalby determining various statistical metrics from the generated set ofoutput values corresponding to the generated samples and combining thestatistical measures. For example, the vehicle computing system 122 maydetermine a mean value, a Z-value and determine the confidence intervalbased on the mean value and the Z-value. In an embodiment, the vehiclecomputing system 122 determines the confidence interval using theformula x±z*σ/(√n), where x is the mean value of the set, z is thez-value, σ is the standard deviation, and n is the number ofobservations. The z-value depends on the percentage value correspondingto the confidence interval. The vehicle computing system 122 maydetermine a N % confidence interval, where example values of N are 95%,90%, or any other percentage value. The z-value depends on the value ofN.

In an embodiment, the vehicle computing system 122 represents the hiddencontext as a vector of values, each value representing a parameter, forexample, a likelihood that a person represented by a traffic entity isgoing to cross the street in front of the autonomous vehicle, a degreeof awareness of the presence of autonomous vehicle in the mind of aperson represented by a traffic entity, and so on.

Overall Process of Navigating an Autonomous Vehicle Through Traffic

FIG. 7 represents a flowchart illustrating the process of navigating theautonomous vehicle based on hidden context, according to an embodiment.The vehicle computing system 122 receives 700 sensor data from sensorsof the autonomous vehicle. For example, the vehicle computing system 122may receive lidar scans from lidars and camera images from camerasmounted on the autonomous vehicle. In an embodiment, the vehiclecomputing system 122 builds a point cloud representation of thesurroundings of the autonomous vehicle based on the sensor data. Thepoint cloud representation includes coordinates of points surroundingthe vehicle, for example, three dimensional points and parametersdescribing each point, for example, the color, intensity, and so on.

The vehicle computing system 122 identifies 702 one or more trafficentities based on the sensor data, for example, pedestrians, bicyclists,or other vehicles driving in the traffic. The traffic entities representnon-stationary objects in the surroundings of the autonomous vehicle.

In an embodiment, the autonomous vehicle obtains a map of the regionthrough which the autonomous vehicle is driving. The autonomous vehiclemay obtain the map from a server. The map may include a point cloudrepresentation of the region around the autonomous vehicle. Theautonomous vehicle performs localization to determine the location ofthe autonomous vehicle in the map and accordingly determines thestationary objects in the point cloud surrounding the autonomousvehicle. The autonomous vehicle may superimpose representations oftraffic entities on the point cloud representation generated.

The vehicle computing system 122 repeats the following steps 704 and 706for each identified traffic entity. The vehicle computing system 122determines 704 motion parameters for the traffic entity, for example,speed and direction of movement of the traffic entity. The vehiclecomputing system 122 also determines 706 a hidden context associatedwith the traffic entity using the prediction engine 114. The vehiclecomputing system 122 determines a measure of uncertainty for the hiddencontext using the probabilistic neural network.

The vehicle computing system 122 navigates 708 the autonomous vehiclebased on the motion parameters, the hidden context, and the measure ofuncertainty. For example, the vehicle computing system 122 may determinea safe distance from the traffic entity that the autonomous vehicleshould maintain based on the motion parameters of the traffic entity.The safe distance is also referred to as a threshold distance, such thatthe autonomous vehicle navigates to stay at least the threshold distanceaway from a traffic entity observed in the traffic. The vehiclecomputing system 122 modulates the safe distance based on the hiddencontext. The vehicle computing system 122 may adjust the safe distancebased on whether the near-term goal of the person indicating that theperson intends to reach a location in the direction of the movement ofthe traffic entity or in a different direction. For example, based onthe motion parameters, the vehicle computing system 122 may determinethat the autonomous vehicle can drive within X meters of the trafficentity. However, the hidden context indicates that the personrepresented by the traffic entity intends to cross the street in adirection different from the direction indicated by the motionparameters. In this situation, the vehicle computing system 122 adjuststhe safe distance such that the autonomous vehicle is able to drivecloser to the traffic entity than the distance X. On the other hand, ifthe hidden context indicates that the person represented by the trafficentity intends to cross the street in a direction same as the directionindicated by the motion parameters, the vehicle computing system 122adjusts the safe distance such that the autonomous vehicle maintains adistance greater than X from the traffic entity.

The vehicle computing system 122 modulates the safe distance based onthe measure of uncertainty for the output values. In cases where themeasure of uncertainty is very high, the vehicle computing system 122increases the safe distance from the traffic entity by certain factor.In cases where the measure of uncertainty is very low, the vehiclecomputing system 122 does not adjust the safe distance from the trafficentity or increases the safe distance from the traffic entity by a muchsmaller factor. Accordingly, the vehicle computing system 122 determinesthe safe distance to be a value directly related to the measure ofuncertainty generated by the probabilistic neural network. For example,the factor by which the safe distance is increased is a value directlyproportional to a degree of uncertainty of the output.

Overall Process of Training Neural Network

FIG. 8 represents a flowchart illustrating the process of generatingtraining dataset for training of neural network illustrated in FIG. 1B,according to an embodiment. The steps illustrated in the flowchart maybe performed in an order different from that illustrated in FIG. 8. Forexample, certain steps may be performed in parallel. The steps may beperformed by modules other than those indicated herein.

The vehicle computing system 122 receives 800 sensor data from sensorsof the autonomous vehicle. For example, the vehicle computing system 122may receive lidar scans from lidars and camera images from camerasmounted on the autonomous vehicle. The vehicle computing system 122provides the sensor data to the server 106 which provides the sensordata to the model training system 82. The server 106 identifies one ormore traffic entities based on the sensor data, for example,pedestrians, bicyclists, or other vehicles driving in the traffic. Thetraffic entities represent non-stationary objects in the surroundings ofthe autonomous vehicle.

The server 106 determines stimuli for presenting to users. A stimulusmay be an image, for example, a video frame showing a traffic entity. Astimulus may be a video, for example, a portion of a video showing atraffic entity. In an embodiment, the server 106 identifies a boundingbox surrounding a video frame or image and may present only the portionof the video frame or image within the bounding box to a user.

The server 106 repeats the following steps for each stimulus. The server106 presents 804 the stimuli to users with request to provide userresponses describing some hidden context attribute of the trafficentity, for example, state of mind of a pedestrian/bicyclist or ameasure of awareness of a vehicle by a pedestrian or bicyclist. Thestimuli are presented to users via a user interface, for example, via awebpage of a website. The server 106 repeats the step of presenting 804and receiving 806 user responses for each of a plurality of users.

The server 106 determines 808 a statistical distribution of userresponses. For example, the statistical distribution may comprise a meanvalue and a standard deviation value. The server 106 stores 810 thestimulus and corresponding statistical distribution of user responses astraining dataset for training of the neural network 120. The server 106provides the training dataset to the model training system 112.

FIG. 9 represents a flowchart illustrating the process of training ofneural network illustrated in FIG. 2, according to an embodiment. Thesteps illustrated in the flowchart may be performed in an orderdifferent from that illustrated in FIG. 9. For example, certain stepsmay be performed in parallel. The steps may be performed by modulesother than those indicated herein.

The model training system 112 performs the following steps to train theneural network 120. The model training system 112 repeats the steps foreach image/video frame or video stored in the training dataset. Themodel training system 112 provides a video frame as input to the neuralnetwork 120. The video frame may be encoded for example, as an array ofpixel data. Each pixel data may comprise the position of the pixel inthe image and one or more values, for example, color of the pixel.

The model training system 112 executes the neural network 120 togenerate outputs. The model training system 112 repeatedly executes theneural network 120. As part of execution, the feature extractioncomponent 125 of the neural network 120 generates 915 a feature vector.The system determines 918 parameters of random variables based on thefeature vector, for example, mean and standard deviation for a normaldistribution. The system performs sampling 920 of the random variablesto generate a matrix wherein the rows of the matrix are sampled vectorsof values. The system probability values corresponding to each row forexample by applying 922 a softmax function to each row.

Each prediction component 130 predicts 918 output values of the hiddencontext. The neural network 120 may generate a different output in eachexecution and the output values have a particular statisticaldistribution that is determined by the parameters of the neural network120.

The model training system 112 determines 925 statistical distribution ofthe predicted output values. The model training system 112 compares 930the statistical distribution of the predicted output values with thestatistical distribution of the user responses received by presentingthe video frame to users as a stimulus. The model training system 112may determine a loss function value based on the statisticaldistribution of the predicted output values with the statisticaldistribution of the user responses. The model training system 112adjusts the parameters of the neural network 120 by performing backpropagation to minimize the loss function.

The trained neural network is provided to the vehicle computing system122 for use in navigating the autonomous vehicle. The autonomous vehiclereceives sensor data, for example, camera images/video frames as theautonomous vehicle drives through traffic. The vehicle computing system122 identifies traffic entities in the images/video frames. The vehiclecomputing system 122 preprocesses the video frames if necessary, forexample, to select a portion of the video frame showing a bounding boxaround the traffic entity and provides the preprocessed video frame tothe neural network 120 as input. The vehicle computing system 122executes the neural network to generate outputs indicating values ofhidden context attributes and uncertainty values associated with thepredicted outputs. The vehicle computing system 122 uses the output ofthe neural network 120 to navigate the autonomous vehicle through thetraffic. In an embodiment, the vehicle computing system 122 uses theoutput generated by the neural network to generate control signalsprovided to the controls of the vehicle, for example, braking system,accelerator, steering, and so on.

FIG. 10 represents a flowchart illustrating the process of generatingtraining dataset for training of neural network illustrated in FIG. 2,according to another embodiment. The system receives an image 1002, forexample, a video frame. The system inserts the image into a featureextractor 1004, for example, the feature extraction component 125. Thefeature extractor produces a feature vector 1006 that has a plurality ofelements.

The neural network 120 repeats the following steps for N times of theneural network 120 is a multi-task neural network that produces outputscorresponding to N tasks. The feature vector is processed by aprobabilistic dense layer 1008 of the neural network 120 (which may bepart of the feature extraction component 125). The probabilistic denselayer 1008 of the neural network 120 produces a vector 1010 of meanvalues and a vector 1012 of deviation values. In other embodiments, theprobabilistic dense layer 1008 may generate other types of statisticaldistribution parameters. In other embodiments, other parameters can beused depending on the type of random variable, for example, Normal,Laplace, Beta, Multinomial, Student-t, and so on.

The neural network 120 combines the vectors 1010 and 1012 to generate avector 1014 of random variables, for example, a vector of normaldistributions. The neural network 120 performs sampling from the vectorof normal distributions to generate a matrix 1016 where each rowrepresents a sampled vector of values. The neural network 120 appliessoftmax function on the rows to produce probability values for each row.The neural network uses the probability values to generate distribution1018 of probabilities according to bins of outputs.

In some embodiments, the neural network 120 also identifies confidenceintervals 1020 when the trained model is executed, i.e., duringinference time, for example, during navigation of a vehicle. Theconfidence interval may not be determined during training of neuralnetwork 120. At inference time, the value of m>1, i.e., multiple samplesare generated. Accordingly, per random variable, m samples aregenerated. For example, if there are 5 random variables, or 1 per outputbin, the neural network 120 generates m probabilities per bin. With thiscollection of probabilities for one certain bin, the neural network 120computes any percentile (since they are all estimates of the samenumber). For example, for a 95% confidence interval, the system takesthe 2.5th and 97.5th percentiles for every individual bin.

Computing Machine Architecture

FIG. 11 is a block diagram illustrating components of an example machineable to read instructions from a machine-readable medium and executethem in a processor (or controller). Specifically, FIG. 11 shows adiagrammatic representation of a machine in the example form of acomputer system 1100 within which instructions 1124 (e.g., software) forcausing the machine to perform any one or more of the methodologiesdiscussed herein may be executed. In alternative embodiments, themachine operates as a standalone device or may be connected (e.g.,networked) to other machines. In a networked deployment, the machine mayoperate in the capacity of a server machine or a client machine in aserver-client network environment, or as a peer machine in apeer-to-peer (or distributed) network environment.

The machine may be a server computer, a client computer, a personalcomputer (PC), a tablet PC, a set-top box (STB), a personal digitalassistant (PDA), a cellular telephone, a smartphone, a web appliance, anetwork router, switch or bridge, or any machine capable of executinginstructions 1124 (sequential or otherwise) that specify actions to betaken by that machine. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute instructions1124 to perform any one or more of the methodologies discussed herein.

The example computer system 1100 includes a processor 1102 (e.g., acentral processing unit (CPU), a graphics processing unit (GPU), adigital signal processor (DSP), one or more application specificintegrated circuits (ASICs), one or more radio-frequency integratedcircuits (RFICs), or any combination of these), a main memory 1104, anda static memory 1106, which are configured to communicate with eachother via a bus 1108. The computer system 1100 may further includegraphics display unit 1110 (e.g., a plasma display panel (PDP), a liquidcrystal display (LCD), a projector, or a cathode ray tube (CRT)). Thecomputer system 1100 may also include alphanumeric input device 1112(e.g., a keyboard), a cursor control device 1114 (e.g., a mouse, atrackball, a joystick, a motion sensor, or other pointing instrument), astorage unit 1116, a signal generation device 1118 (e.g., a speaker),and a network interface device 1120, which also are configured tocommunicate via the bus 1108.

The storage unit 1116 includes a machine-readable medium 1122 on whichis stored instructions 1124 (e.g., software) embodying any one or moreof the methodologies or functions described herein. The instructions1124 (e.g., software) may also reside, completely or at least partially,within the main memory 1104 or within the processor 1102 (e.g., within aprocessor's cache memory) during execution thereof by the computersystem 1100, the main memory 1104 and the processor 1102 alsoconstituting machine-readable media. The instructions 1124 (e.g.,software) may be transmitted or received over a network 1126 via thenetwork interface device 1120.

While machine-readable medium 1122 is shown in an example embodiment tobe a single medium, the term “machine-readable medium” should be takento include a single medium or multiple media (e.g., a centralized ordistributed database, or associated caches and servers) able to storeinstructions (e.g., instructions 1124). The term “machine-readablemedium” shall also be taken to include any medium that is capable ofstoring instructions (e.g., instructions 1124) for execution by themachine and that cause the machine to perform any one or more of themethodologies disclosed herein. The term “machine-readable medium”includes, but not be limited to, data repositories in the form ofsolid-state memories, optical media, and magnetic media.

Additional Considerations

For every flowchart presented herein in the figures, the stepsillustrated in the flowchart may be performed in an order different fromthat illustrated in the figure. For example, certain steps may beperformed in parallel. The steps may be performed by modules other thanthose indicated herein.

Although embodiments disclosed describe techniques for navigatingautonomous vehicles, the techniques disclosed are applicable to anymobile apparatus, for example, a robot, a delivery vehicle, a drone, andso on.

The subject matter described herein can be implemented in digitalelectronic circuitry, or in computer software, firmware, or hardware,including the structural means disclosed in this specification andstructural equivalents thereof, or in combinations of them. The subjectmatter described herein can be implemented as one or more computerprogram products, such as one or more computer programs tangiblyembodied in an information carrier (e.g., in a machine readable storagedevice) or in a propagated signal, for execution by, or to control theoperation of, data processing apparatus (e.g., a programmable processor,a computer, or multiple computers). A computer program (also known as aprogram, software, software application, or code) can be written in anyform of programming language, including compiled or interpretedlanguages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A computer program does notnecessarily correspond to a file. A program can be stored in a portionof a file that holds other programs or data, in a single file dedicatedto the program in question, or in multiple coordinated files (e.g.,files that store one or more modules, sub programs, or portions ofcode). A computer program can be deployed to be executed on one computeror on multiple computers at one site or distributed across multiplesites and interconnected by a communication network.

The processes and logic flows described in this specification, includingthe method steps of the subject matter described herein, can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions of the subject matter describedherein by operating on input data and generating output. The processesand logic flows can also be performed by, and apparatus of the subjectmatter described herein can be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processor of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto optical disks, or optical disks. Information carrierssuitable for embodying computer program instructions and data includeall forms of nonvolatile memory, including by way of examplesemiconductor memory devices, (e.g., EPROM, EEPROM, and flash memorydevices); magnetic disks, (e.g., internal hard disks or removabledisks); magneto optical disks; and optical disks (e.g., CD and DVDdisks). The processor and the memory can be supplemented by, orincorporated in, special purpose logic circuitry.

To provide for interaction with a user, the subject matter describedherein can be implemented on a computer having a display device, e.g., aCRT (cathode ray tube) or LCD (liquid crystal display) monitor, fordisplaying information to the user and a keyboard and a pointing device,(e.g., a mouse or a trackball), by which the user can provide input tothe computer. Other kinds of devices can be used to provide forinteraction with a user as well. For example, feedback provided to theuser can be any form of sensory feedback, (e.g., visual feedback,auditory feedback, or tactile feedback), and input from the user can bereceived in any form, including acoustic, speech, or tactile input.

The subject matter described herein can be implemented in a computingsystem that includes a back end component (e.g., a data server), amiddleware component (e.g., an application server), or a front endcomponent (e.g., a client computer having a graphical user interface ora web browser through which a user can interact with an implementationof the subject matter described herein), or any combination of such backend, middleware, and front end components. The components of the systemcan be interconnected by any form or medium of digital datacommunication, e.g., a communication network. Examples of communicationnetworks include a local area network (“LAN”) and a wide area network(“WAN”), e.g., the Internet.

It is to be understood that the disclosed subject matter is not limitedin its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The disclosed subject matter is capable ofother embodiments and of being practiced and carried out in variousways. Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods, and systems for carryingout the several purposes of the disclosed subject matter. It isimportant, therefore, that the claims be regarded as including suchequivalent constructions insofar as they do not depart from the spiritand scope of the disclosed subject matter.

Although the disclosed subject matter has been described and illustratedin the foregoing exemplary embodiments, it is understood that thepresent disclosure has been made only by way of example, and thatnumerous changes in the details of implementation of the disclosedsubject matter may be made without departing from the spirit and scopeof the disclosed subject matter, which is limited only by the claimswhich follow.

We claim:
 1. A method for navigating autonomous vehicles, comprising:training a probabilistic neural network, the probabilistic neuralnetwork configured to perform steps comprising: receiving as input, animage of traffic, the image displaying a traffic entity belonging to thetraffic, generating a feature vector for a plurality of features, thefeature vector comprising values describing statistical distribution foreach feature, and generating output representing hidden context for thetraffic entity, the output comprising a plurality of values, each valuerepresenting a likelihood of receiving a particular user response from auser presented with the image; receiving a new camera image captured bya camera mounted on an autonomous vehicle navigating through traffic;executing the probabilistic neural network to generate outputrepresenting hidden context for at least a traffic entity displayed inthe new image; determining a measure of uncertainty for each of theplurality of values; and navigating the autonomous vehicle to avoid thetraffic entity displayed in the new image, the navigation based on atleast the measure of uncertainty generated by the probabilistic neuralnetwork.
 2. The method of claim 1, wherein the values describingstatistical distribution for each feature comprise a mean value and astandard deviation for the feature.
 3. The method of claim 1, whereintraining the probabilistic neural network comprises determining evidencelower bound between the plurality of values predicted and a plurality ofvalues determined from user responses obtained from users presented withimages from the training dataset.
 4. The method of claim 1, whereindetermining the measure of uncertainty for each of the plurality ofvalues comprises: generating a plurality of samples for the input imageusing the probabilistic neural network; and determining the confidenceinterval for each of the plurality of input using the plurality ofsamples.
 5. The method of claim 1, wherein navigating the autonomousvehicle comprises ensuring that the autonomous vehicle stays at least athreshold distance away from the traffic entity displayed in the newimage, the threshold distance determined based on the measure ofuncertainty generated by the probabilistic neural network.
 6. The methodof claim 1, wherein the threshold distance is determined to be a valuedirectly related to the measure of uncertainty generated by theprobabilistic neural network.
 7. The method of claim 1, wherein thehidden context represents a state of mind of a user represented by thetraffic entity.
 8. The method of claim 1, wherein the hidden contextrepresents a task that a user represented by the traffic entity isplanning on accomplishing.
 9. The method of claim 1, wherein the hiddencontext represents a degree of awareness of the autonomous vehicle by auser represented by the traffic entity.
 10. The method of claim 1,wherein the hidden context represents a goal of a user represented bythe traffic entity, wherein the user expects to achieve the goal withina threshold time interval.
 11. The method of claim 1, wherein navigatingthe autonomous vehicle comprises: generating signals for controlling theautonomous vehicle based on the motion parameters and the hidden contextof each of the traffic entities; and sending the generated signals tocontrols of the autonomous vehicle.
 12. The method of claim 1, whereinthe probabilistic neural network is a probabilistic convolutional neuralnetwork.
 13. A non-transitory computer readable storage medium storinginstructions, that when executed by a processor, cause the processor toperform steps comprising: training a probabilistic neural network, theprobabilistic neural network configured to perform steps comprising:receiving as input, an image of traffic, the image displaying a trafficentity belonging to the traffic, generating a feature vector for aplurality of features, the feature vector comprising values describingstatistical distribution for each feature, and generating outputrepresenting hidden context for the traffic entity, the outputcomprising a plurality of values, each value representing a likelihoodof receiving a particular user response from a user presented with theimage; receiving a new camera image captured by a camera mounted on anautonomous vehicle navigating through traffic; executing theprobabilistic neural network to generate output representing hiddencontext for at least a traffic entity displayed in the new image;determining a measure of uncertainty for each of the plurality ofvalues; and navigating the autonomous vehicle to avoid the trafficentity displayed in the new image, the navigation based on at least themeasure of uncertainty generated by the probabilistic neural network.14. The non-transitory computer readable storage medium of claim 13,wherein the values describing statistical distribution for each featurecomprise a mean value and a standard deviation for the feature.
 15. Thenon-transitory computer readable storage medium of claim 13, whereintraining the probabilistic neural network comprises determining evidencelower bound between the plurality of values predicted and a plurality ofvalues determined from user responses obtained from users presented withimages from the training dataset.
 16. The non-transitory computerreadable storage medium of claim 13, wherein determining the measure ofuncertainty for each of the plurality of values comprises: generating aplurality of samples for the input image using the probabilistic neuralnetwork; and determining the confidence interval for each of theplurality of input using the plurality of samples
 17. The non-transitorycomputer readable storage medium of claim 13, wherein navigating theautonomous vehicle comprises ensuring that the autonomous vehicle staysat least a threshold distance away from the traffic entity displayed inthe new image, the threshold distance determined based on the measure ofuncertainty generated by the probabilistic neural network.
 18. Thenon-transitory computer readable storage medium of claim 10, wherein thethreshold distance is determined to be a value directly related to themeasure of uncertainty generated by the probabilistic neural network.19. The non-transitory computer readable storage medium of claim 10,wherein navigating the autonomous vehicle comprises: generating signalsfor controlling the autonomous vehicle based on the motion parametersand the hidden context of each of the traffic entities; and sending thegenerated signals to controls of the autonomous vehicle.
 20. A computersystem comprising: a processor; and a non-transitory computer readablestorage medium storing instructions that when executed by the processor,cause the processor to perform steps comprising: training aprobabilistic neural network, the probabilistic neural networkconfigured to perform steps comprising: receiving as input, an image oftraffic, the image displaying a traffic entity belonging to the traffic,generating a feature vector for a plurality of features, the featurevector comprising values describing statistical distribution for eachfeature, and generating output representing hidden context for thetraffic entity, the output comprising a plurality of values, each valuerepresenting a likelihood of receiving a particular user response from auser presented with the image; receiving a new camera image captured bya camera mounted on an autonomous vehicle navigating through traffic;executing the probabilistic neural network to generate outputrepresenting hidden context for at least a traffic entity displayed inthe new image; determining a measure of uncertainty for each of theplurality of values; and navigating the autonomous vehicle to avoid thetraffic entity displayed in the new image, the navigation based on atleast the measure of uncertainty generated by the probabilistic neuralnetwork.